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Abstract 

Topic modeling of textual corpora is an important and chal¬ 
lenging problem. In most previous work, the “bag-of-words” 
assumption is usually made which ignores the ordering of 
words. This assumption simplifies the computation, but it un¬ 
realistically loses the ordering information and the semantic 
of words in the context. In this paper, we present a Gaus¬ 
sian Mixture Neural Topic Model (GMNTM) which incorpo¬ 
rates both the ordering of words and the semantic meaning 
of sentences into topic modeling. Specifically, we represent 
each topic as a cluster of multi-dimensional vectors and em¬ 
bed the corpus into a collection of vectors generated by the 
Gaussian mixture model. Each word is affected not only by 
its topic, but also by the embedding vector of its surround¬ 
ing words and the context. The Gaussian mixture compo¬ 
nents and the topic of documents, sentences and words can 
be learnt jointly. Extensive experiments show that our model 
can learn better topics and more accurate word distributions 
for each topic. Quantitatively, comparing to state-of-the-art 
topic modeling approaches, GMNTM obtains significantly 
better performance in terms of perplexity, retrieval accuracy 
and classification accuracy. 

Introduction 

With the growing of large collection of electronic texts, 
much attention has been given to topic modeling of tex¬ 
tual corpora, designed to identify representations of the data 
and learn thematic structure from large document collections 
without human supervision. Topic models have been applied 
to a variety of applications, including information retrieval 
(Wei and Croft 2006), collaborative filtering (Marlin 2003), 
authorship identification (Rosen-Zvi et al. 2004) and opin¬ 
ion extraction (Lin et al. 2012), etc. Existing topic models 
(Griffiths and Tenenbaum 2004; Mcauliffe and Blei 2008; 
Blei 2012) are built based on the assumption that each doc¬ 
ument is represented by a mixture of topics, where each 
topic defines a probability distribution over words. These 
models, including the probabilistic latent semantic analysis 
(PLSA) (Hofmann 1999) model and latent Dirichlet allo¬ 
cation (LDA) (Blei, Ng, and Jordan 2003) model, can be 
viewed as graphical models with latent variables. Some non- 
parametric extensions to these models have also been quite 
successful (Teh et al. 2006; Steyvers and Griffiths 2007). 


Nevertheless, exact inference for these model is computa¬ 
tionally hard, so one has to resort to slow or inaccurate ap¬ 
proximations to compute the posterior distribution over top¬ 
ics. New undirected graphical model approaches, including 
the Replicated softmax model (Hinton and Salakhutdinov 
2009), are also successfully applied to exploring the top¬ 
ics of the text, and in particular cases they outperform LDA 
(Srivastava, Salakhutdinov, and Hinton 2013). 

A major limitation of these topic models and many of 
their extensions is the bag-of-word assumption, which as¬ 
sumes that document can be fully characterized by bag-of- 
word features. This assumption is favorable in the computa¬ 
tional point of view, but loses the ordering of the words and 
cannot properly capture the semantics of the context. For 
example, the phrases “the department chair couches offers” 
and “the chair department offers couches” have the same un¬ 
igram statistics, but are about quite different topics. When 
deciding which topic generated the word “chair” in the first 
sentence, knowing that it was immediately preceded by the 
word “department” makes it much more likely to have been 
generated by a topic that assigns high probability to words 
related to university administration (Wallach 2006). 

There has been little work on developing topic models 
where the order of words is taken into consideration. To 
remove the assumption that the order of words is negligible, 
Gruber, Weiss, and Rosen-Zvi (2007) propose modeling the 
topics of words in the document via a Markov chain. Wal¬ 
lach (2006) explores a hierarchical generative probabilis¬ 
tic model that incorporates both n-gram statistics and la¬ 
tent topic variables. Even though they consider the order of 
words to some extent, their model is still not capable of char¬ 
acterizing the semantics of words. For example, the integer 
representation of the words “teacher” and “teach” are com¬ 
pletely unrelated, even if we know they have strong semantic 
connections and are very likely belonging to the same topic. 
To seek a distributed way of representing words that capture 
semantic similarities, several Neural Probabilistic Language 
Models (NPLMs) have been proposed (Mnih and Hinton 
2009; Mnih and Teh 2012; Mnih and Kavukcuoglu 2013; 
Mikolov et al. 2013; Le and Mikolov 2014). Nevertheless, 
the dense word embeddings learned by previous NPLMs 
cannot be directly interpreted as topics. This is because 
that word embeddings are usually considered opaque, in the 
sense that it is difficult to assign meanings to the the vector 



representation. 

In this paper, we proposed a novel topic model called 
the Gaussian Mixture Neural Topic Model (GMNTM). The 
work is inspired by the recent neural probabilistic lan¬ 
guage models (Mnih and Hinton 2009; Mnih and Teh 2012; 
Mnih and Kavukcuoglu 2013; Mikolov et al. 2013; Le and 
Mikolov 2014). We represent the topic model as a Gaussian 
mixture model of vectors which encode words, sentences 
and documents. Each mixture component is associated with 
a specific topic. We present a method that jointly learns the 
topic model and the vector representation. As in NPLM 
methods, the word embeddings are learnt to optimize the 
predictability of a word using its surrounding words, with 
an important constraint that the vector representations are 
sampled from the Gaussian mixture which represents topics. 
Because the semantic meaning of sentences and documents 
are incorporated to infer the topic of a specific word, in our 
model, words with similar semantics are more likely to be 
clustered into the same topic, and topics of sentences and 
documents are more accurately learned. It potentially over¬ 
comes the weaknesses of the bag-of-word method and the 
bag-of-n-grams method, both of which don’t use the order 
of words or the semantic of the context. We conduct ex¬ 
periments to verify the effectiveness of the proposed model 
on two widely used publicly available datasets. The experi¬ 
ment results show that our model substantially outperforms 
the state-of-the-art models in terms of perplexity, document 
retrieval quality and document classification accuracy. 

Related works 

In the past decade, a great variety of topic models have 
been proposed, which can extract interesting topics in the 
form of multinomial distributions automatically from texts 
(Blei, Ng, and Jordan 2003; Griffiths and Tenenbaum 2004; 
Blei 2012; Gruber, Weiss, and Rosen-Zvi 2007; Hinton and 
Salakhutdinov 2009). Among these approaches, LDA (Blei, 
Ng, and Jordan 2003) and its variants are the most popu¬ 
lar models for topic modeling. The mixture of topics per 
document in the LDA model is generated from a Dirichlet 
prior mutual to all documents in the corpus. Different exten¬ 
sions of the LDA model have been proposed. Lor example, 
Teh et al. (2006) assumes that the number of mixture com¬ 
ponents is unknown a prior and is to be inferred from the 
data. Mcauliffe and Blei (2008) develops a supervised latent 
Dirichlet allocation model (sLDA) for document-response 
pairs. Recent work incorporates context information into 
the topic modeling, such as time (Wang and McCallum 
2006), geographic location (Mei et al. 2006), authorship 
(Steyvers et al. 2004), and sentiment (Yang et al. 2014b; 
2014a), to make topic models fit expectations better. 

Recently, there are several undirected graphical models 
being proposed, which typically outperform LDA. Mcauliffe 
and Blei (2008) present a two-layer undirected graphical 
model, called “Replicated Softmax”, that can be used to 
model and automatically extract low-dimensional latent se¬ 
mantic representations from a large unstructured collection 
of document. Hinton and Salakhutdinov (2009) extend 
“Replicated Softmax” by adding another layer of hidden 


units on top of the first with bipartite undirected connec¬ 
tions. Neural network based approaches, such as Neural 
Autoregressive Density Estimators (DocNADE) (Larochelle 
and Lauly 2012) and Hybrid Neural Network-Latent Topic 
Model (Wan, Zhu, and Lergus 2012), are also shown outper¬ 
forming the LDA model. 

However, all of these these topic models employ the bag- 
of-words assumption, which is rarely true in practice. The 
bag-of-word assumption loses the ordering of the words and 
ignore the semantics of the context. There are several previ¬ 
ous literature taking the order of words into account. Wal- 
lach (2006) explores a hierarchical generative probabilis¬ 
tic model that incorporates both n-gram statistics and latent 
topic variables. They extend a unigram topic model so that 
it can reflect properties of a hierarchical Dirichlet bigram 
model. Gruber, Weiss, and Rosen-Zvi (2007) propose mod¬ 
eling the topic of words a Markov chain. Llorez and Nach¬ 
man (2014) exploits the semantics regularities captured by 
a Recurrent Neural Network (RNN) in text documents to 
build a recommender system. Although these methods cap¬ 
tures the ordering of words, none of them them consider the 
semantics, thus they cannot capture the semantic similarities 
between words such as “teach” and “teacher”. In contrast, 
our model is inspired by the recent work in learning vec¬ 
tor representations of words which are proved to capture the 
semantics of texts (Mnih and Hinton 2009; Mnih and Teh 
2012; Mnih and Kavukcuoglu 2013; Mikolov et al. 2013; 
Le and Mikolov 2014). Our topic model captures both the 
ordering of words and the semantics of the context. As 
a consequence, semantically similar words are more likely 
having similar topic distribution (e.g., “Jesus” and “Christ” 
)■ 

The GMNTM Model 

In this section, we first describe the GMNTM model as a 
probabilistic generative model. Then we illustrate the infer¬ 
ence algorithm for estimating the model parameters. 

Generative model 

We assume there are W different words in the vocabu¬ 
lary and there are D documents in corpus. Lor each word 
w £ {1,..., W} in vocabulary, there is an associated V- 
dimensional vector representation vec(u>) € 7Z l for the 
word. Each document in corpus with index d £ {1,..., D} 
also has a vector representation vec (d) £ V} . If all the 
documents contain S sentences, then these sentences are in¬ 
dexed bys £ {1,..., S}. The sentence with index s is as¬ 
sociated with a vector representation vec(s) £ 1Z 1 . 

There are T topics in the GMNTM model, where T is 
designated by the user. Each topic corresponds to a Gaus¬ 
sian mixture component. The fc-th topic is represented by a 
H-dimensional Gaussian distribution A f(Hk, Sfe) with mix¬ 
ture weight 7rj. £ 1Z, where /j fc £ 1Z V , E fc £ 7 Z v xV , and 
71 k = 1. The parameters of the Gaussian mixture 
model are collectively represented by 

A = { 7 T k ,fi k ,T, k } k = 1,... ,T (1) 

Given the collection of parameters, we use 



( 2 ) 


T 

p(x|A) = yVjA/Xxl/Zj.Sj) 

*=i 


to represent the probability distribution for sampling a vector 
x from the Gaussian mixture model. 

We describe the procedure that the corpus is generated. 
Given the Gaussian mixture model A, the generative process 
is described as follow: for each word w in the vocabulary, 
we sample its topic z(u>) from the multinomial distribution 
7 r := (7Ti, 7T2,..., 7 tt) and sample its vector representation 
vec(tu) from distribution J\f (p z ( w )Mz(w))- Equivalently, 
the vector vec(w) is sampled from the Gaussian mixture 
model parameterized by A. For each document d and each 
sentence s in the document, we sample their topics z(d), 
z(s) from distribution n and sample their vector represen¬ 
tations, namely vec (d) and vec(s), also from the Gaussian 
mixture model. Let 'I' be the collection of latent vectors as¬ 
sociated with all the words, sentences and documents in the 
corpus. 


:= {vec(tu)} U {vec(d)} U {vec(s)} (3) 


For each word slot in the sentence, its word realization 
is generated according to the document’s vector vec(d), the 
current sentence’s vector vec(s) as well as at most m previ¬ 
ous words in the same sentence. Formally, for the z-th loca¬ 
tion in the sentence, we represent its word realization by u',. 
The probability distribution of Wi is defined by: 

P ( Wi = tO | d, S, Wi- m , Wi- 1) 

m 

cx exp(ajoc + <Cn + a ™ + b) (4) 

t=i 


where adoci a sen an d a t are influences from the document, 
the sentence and the previous word, respectively. They are 
defined by 

a doc = ( u doc> vec (<i)) (5) 

C = «e„>vec(s)) (6) 

a t = {u?,vec(wi-t)) (7) 

Here, , u Zm u t’ £ are parameters of the model, and 
they are shared across all slots in the corpus. We use U to 
represent this collection of vectors, 

U .— {ttdoc > tz sen } U {lit ^ 1; 2, ..., jti{ } (8) 


Combining the equations above, the probability distribu¬ 
tion of w-i is defined by a multi-class logistic model, where 
the features come from the vectors associated with the doc¬ 
ument, the sentence and the m previous words. By estimat¬ 
ing the model parameters, we learn the word representations 
that make one word predictable from its previous words and 
the context. Jointly, we learn the distribution of topics that 
words, sentences and documents belong to. 

Given the model parameters and the vectors for docu¬ 
ments, sentences and words, we can infer the posterior prob¬ 
ability distribution of topics. In particular, for a document d 
with vector representation vec(d), the posterior distribution 


of its topic, namely q(z(d)), is easy to calculate. For any 
z £ 1, 2,..., T, we have 


q{z{d) =z) = 


n z J\f (vec(d)\p z , S, 


(9) 


E fc =i (vec(d)|/ifc, S fc ) 

Similarly, for each sentence s in the document d, the poste¬ 
rior distribution of its topic is 


q{z(s) =z) = 


tt z M (vec(s)|// 2 , £ z ) 


( 10 ) 


E fc =i n kM (vec(s)|^Zfc, £fc) 

For each word w in the vocabulary, the posterior distribution 
of its topic is similarly calculated as 


q{z{w) = z) = 


n z Af (■ vec(w)\/j , z , £ z ) 


( 11 ) 


Efc=i nkN (vec(w)\pk, Sfc) 

Finally, for each word slot in the document, we also want 
to explore its topic. Since the topic of a particular loca¬ 
tion in the document is affected by its word realization and 
the sentence/document it belongs to, we define the probabil¬ 
ity of it belonging to topic z proportional to the product of 
q(z(w) = z), q(z(s) = z) and q{z{d) = z), where w, s, 
and d are the word, the sentence and the document that this 
word slot associates with. 


Estimating model parameters 

We estimate the model parameters A, U and by maximiz¬ 
ing the likelihood of the generative model. The parameter 
estimation consists of two stages. In Stage I, we maximize 
the likelihood of the model with respect to A. Since A char¬ 
acterizes a Gaussian mixture model, this procedure can be 
implemented by the Expectation Maximization (EM) algo¬ 
rithm. In Stage II, we maximize the model likelihood with 
respect to U and S', this procedure can be implemented by 
stochastic gradient descent. We alternatively execute Stage 
I and Stage II until the parameters converge. The algorithm 
in this section is summarized in Algorithm 1. 

Stage I: Estimating A In this stage, the latent vector 
of words, sentences and documents are fixed. We esti¬ 
mate the parameters of the Gaussian mixture model A = 
{tta;, Phi £fe}- This is a classical statistical estimation prob¬ 
lem which can be solved by running the EM algorithm. The 
reader can refer to the book (Bishop 2006) for the imple¬ 
mentation details. 

Stage II: estimating U and 'U When A is known and fixed, 
we estimate the model parameters U and the latent vectors 
t/z by maximizing the log-likelihood of the generative model. 
In particular, we iteratively sample a location in the corpus, 
and consider the log-likelihood of the observed word at this 
location. Let the word realization at location i be repre¬ 
sented by Wi. The log-likelihood of this location is equal 
to 

m 

MU , M = log(p(tf|A)) + a2 c + <n + 6 

t=1 
m 

-log (J2 ex P( a doc + a Zn + J2 a t + b )) 

w t—1 

( 12 ) 



Algorithm 1 Inference Algorithm 

• Inputs: A corpus containing D documents, S sentences, 

and a vocabulary containing W distinct words 

• Initialize parameters 

- Randomly initialize the vectors , I / . 

- Initialize parameters U with all-zero vectors. 

- Initialize Gaussian mixture model parameters with the 
standard normal distribution J\f{ 0, diag(l)). 

• Repeat until converge 

- Fixing parameters U and l / / , run the EM algorithm to 
estimate the Gaussian mixture model parameters A. 

- Fixing the Gaussian mixture model A, run stochastic 
gradient descent to maximize the log-likelihood of the 
model with respect to parameters U and , I / . 


where p(}P |A) is the prior distribution of parameter in the 
Gaussian mixture model, defined by equation (2). The quan¬ 
tities aJf oc , a™ en and af are defined in equations (5), (6), and 
(7). The objective function J 1 ( U, 'P) involves all parameters 
in the collections (£/,'/'). Taking the computation efficiency 
into consideration, we only update the parameters associated 
with the word . Concretely, we update 


vec (wi) <r- vec(wj) + a 


dJj(U,*) 
dvec (w^ 


<r~ u+ 


+ a 


dJi(U,$) 

duT 


(13) 

(14) 


with a as the learning rate. Similarly, we update vec(s), 
vec (d) and u™ , u™ en using the same gradient step, as 
they are parameters associated with the current sentence and 
the current document. Once the gradient update is accom¬ 
plished, we sample another location to continue the update. 
The procedure terminates when there are sufficient number 
of updates performed, so that both U and converge to fixed 
values. 


Experiments 

In this section, we evaluate our model on the 20 Newsgroups 
and the Reuters Corpus Volume 1 (RCVl-v2) data sets. Fol¬ 
lowed the evaluation in (Srivastava, Salakhutdinov, and Hin¬ 
ton 2013), we compare our GMNTM model with the state- 
of-the-art topic models in perplexity, retrieval quality and 
classification accuracy. 

Datasets description 

We adopt two widely used datasets, the 20 Newsgroups data 
and the RCVl-v2 data, in our evaluations. Data preprocess¬ 
ing is performed on both datasets. We first remove non¬ 
alphabet characters, numbers, pronoun, punctuation and 
stop words from the text. Then, stemming is applied so as to 
reduce the vocabulary size and settle the issue of data spare¬ 
ness. The detailed properties of the datasets are described as 
follow. 


20 Newsgroups dataset: This dataset is a collection of 
18,845 newsgroup documents 1 . The corpus is partitioned 
into 20 different newsgroups, each corresponding to a sep¬ 
arate topic. Following the preprocessing in (Hinton and 
Salakhutdinov 2009) and (Farochelle and Lauly 2012), the 
dataset is partitioned chronologically into 11,314 training 
documents and 7,531 testing documents. 

Reuters Corpus Volume 1 (RCVl-v2): This dataset is an 
archive of 804,414 news wire stories produced by Reuters 
journalists between August 20, 1996, and August 19, 1997 
(Lewis et al. 2004) 2 . RCVl-v2 has been manually cate¬ 
gorized into 103 topics, and the topic classes form a tree 
which is typically of depth 3. As in (Hinton and Salakhut¬ 
dinov 2009) and (Larochelle and Lauly 2012), the data was 
randomly split into 794,414 training documents and 10,000 
testing documents. 

Baseline methods 

In the experiments, the proposed topic modeling approach is 
compared with several baseline methods, which we describe 
below: 

Latent Dirichlet Allocation (LDA): In the LDA model 
(Blei, Ng, and Jordan 2003), we used the online variational 
inference implementation of the gensim toolkit 3 . We used 
the recommended parameter setting a = 1/T. 

Hidden Topic Markov Models (HMM): This model is pro¬ 
posed by (Gruber, Weiss, and Rosen-Zvi 2007), which mod¬ 
els the topics of words in the document as a Markov chain. 
The HMM model is run using the publicly available code 4 . 
We use default settings for all hyper parameters. 
Over-Replicated Softmax (ORS): This model is proposed 
by (Srivastava, Salakhutdinov, and Hinton 2013). It is a 
two hidden layer DBM model, which has been shown to 
obtain a state-of-the-art performance in terms of classifi¬ 
cation and retrieval tasks compared with Replicated Soft- 
max model (Hinton and Salakhutdinov 2009) and Cannon¬ 
ade model (Larochelle and Lauly 2012). 

Implementation details 

In our GMNTM model, the learning rate a is set to 0.025 
and gradually reduced to 0.0001. For each word, at most 
to = 6 previous words in the same sentence is used as the 
context. For easy comparison with other models, the word 
vector size is set to the same as the number of topics V = 
T = 128. Increasing the word vector size further could 
improve the quality of the topics that are generated by the 
GMNTM model. 

Documents are split into sentences and words using the 
NLTK toolkit (Bird 2006) 5 . The Gaussian mixture model is 
inferred using the variational inference algorithm in scikit- 
learn toolkit (Pedregosa et al. 2011) 6 . To perform compa¬ 
rable experiments with restricted vocabulary, words outside 

'Available at http://qwone.com/--jason/20Newsgroups 
2 Available at http://trec.nist.gov/data/reuters/reuters.html 
3 http://radi mrehurek.com/gensim/models/remodel.html 
4 http://code.google.com/p/Oppenheimer/downloads/list 
5 http://www.nltk.org/ 

6 http:// scikit-learn.org/ 



Data Set 

LDA 

HTMM 

ORS 

GMNTM 


Data Set 

LDA 

HTMM 

ORS 

GMNTM 

20 Newsgroups 

1068 

1013 

949 

933 


20 Newsgroups 

65.7% 

66.5% 

66.8% 

73.1% 

RCVl-v2 

1246 

1039 

982 

826 


RCVl-v2 

0.304 

0.395 

0.401 

0.445 


Table 1: Comparison of test perplexity per word with 128 Table 2: Comparison of classification accuracy on 20 News- 
topics groups and Mean Precision on Reuters RCVl-v2 with 128 

topics 


of the vocabulary is replaced as a special token and doesn’t 
count into the word perplexity calculation. 

Generative model evaluation 

We first evaluate our model’s performance as a generative 
model for documents. We perform our evaluation on the 20 
Newsgroups dataset and the RCVl-v2 dataset. For each of 
the datasets, we extract the words from raw data and pre¬ 
serve the ordering of words. We follow the same evaluation 
as in (Srivastava, Salakhutdinov, and Hinton 2013), compar¬ 
ing our model with the other models in terms of perplexity. 

We estimate the log-probability for 1000 held-out docu¬ 
ments that are randomly sampled from the test sets. Af¬ 
ter running the algorithm to infer the vector representations 
of words, sentences, and documents in held-out test docu¬ 
ments, the average test perplexity per word is then estimated 
as exp (~jf T, w log p(w)), where N are the total number 
of words in the held-out test documents, and p(w) is calcu¬ 
lated according to equation (4). 

Table 1 shows the perplexity for each dataset. The per¬ 
plexity for Over-Replicated Softmax is taken from (Sri¬ 
vastava, Salakhutdinov, and Hinton 2013). As shown by 
Table 1, our model performs significantly better than the 
other models on both datasets in terms of perplexity. More 
specifically, for 20 Newsgroups data set, the perplexity de¬ 
creases from 949 to 933, and for RCVl-v2 data set, it de¬ 
creases from 982 to 826. This verifies the effectiveness of 
the proposed topic modeling approach in fitting the dataset. 
The GMNTM model works particularly well on large-scale 
datasets such as RCVl-v2. 

Document retrieval evaluation 

To evaluate the quality of the documents representations that 
are learnt by our model, we perform an information retrieval 
task. Following the setting in (Srivastava, Salakhutdinov, 
and Hinton 2013), documents in the training set are used as 
a database, while the test set is used as queries. For each 
query, documents in the database are ranked using cosine 
distance as the similarity metric. The retrieval task is per¬ 
formed separately for each label and the results are aver¬ 
aged. Figure 1 compares the precision-recall curves with 
128 topics. The curves for LDA and Over-Replicated are 
taken from (Srivastava, Salakhutdinov, and Hinton 2013). 
We see that for the 20 Newsgroups dataset, our model per¬ 
forms on par or slightly better than the other models. While 
for the RCVl-v2 dataset, our model achieves a significant 
improvement. Since RCVl-v2 contains a greater amount of 
texts, the GMNTM model considering the ordering of words 
is more powerful in mining the semantics of the text. 


Document classification evaluation 

Following the evaluation of (Srivastava, Salakhutdinov, and 
Hinton 2013), we also perform document classification with 
the learnt topic representation from our model. The same 
as in (Srivastava, Salakhutdinov, and Hinton 2013), multi¬ 
nomial logistic regression with a cross entropy loss function 
is used for the 20 Newsgroups data set, and the evaluation 
metric is the classification accuracy. For the RCVl-v2 data 
set, we use independent logistic regression for each label. 
The evaluation metric is Mean Average Precision. 

We summarize the experiment results with 128 topics in 
Table 3. The results of document classification for LDA 
and Over-Replicated Softmax are taken from (Srivastava, 
Salakhutdinov, and Hinton 2013). According to Table 3, the 
proposed model substantially outperforms other models on 
both datasets for document classification. For the 20 News- 
groups dataset, the overall accuracy of the Over-Replicated 
Softmax model is 66.8%, which is slightly higher than LDA 
and HTMM. Our model further improves the classification 
result to 73.1%. On RCVl-v2 dataset, we observe the simi¬ 
lar results. The mean average precision increases from 0.401 
(Over-Replicated Softmax) to 0.445 (our model). 

Qualitative inspection of topic specialization 

Since topic models are often used for the exploratory anal¬ 
ysis of unlabeled text, we also evaluate whether meaning¬ 
ful semantics are captured by our model. Due to the space 
limit, we only illustrate four topics extracted by our model 
and LDA which are topics about religion, space, sports and 
security. These topics are also captured as (sub)categories 
in the 20 Newsgroups dataset. Table 3 shows the 4 topics 
learnt by the GMNTM model and the corresponding top¬ 
ics learnt by LDA. In each topic, we visualize it using 10 
words with the largest weights. The 4 topics shown in Ta¬ 
ble 3 for both models are easy for interpretation according 
to the top words. However, we see that the topics found 
by the two models are different in nature. GMNTM finds 
topics that consist of the words that are consecutive in the 
document or the words having similar semantics. For exam¬ 
ple, in the GMNTM model, “Christ” and “Christian” share 
the same topics, mainly because they have strong seman¬ 
tic connections, even though they don’t co-occur that often, 
which makes LDA unable to put them in the same topic. On 
the other hand, LDA often find some general words such as 
“would” and “accept” for the religion topic, which are un¬ 
helpful for interpreting the topics. 





20 Newsgroups 


Reuters RCV1-V2 




Recall 


Recall 


Figure 1: Document Retrieval Evaluation 


GMNTM Topic words 

LDA Topic Words 

god 

space 

game 

key 

god 

space 

year 

key 

jesus 

orbit 

play 

public 

believe 

nasa 

hockey 

encryption 

christ 

earth 

season 

encryption 

jesus 

research 

team 

use 

believe 

solar 

team 

security 

sin 

center 

division 

des 

Christian 

spacecraft 

win 

escrow 

one 

shuttle 

league 

system 

bible 

surface 

hockey 

secure 

mary 

launch 

nhl 

rsa 

lord 

planet 

hand 

data 

lord 

station 

last 

public 

truth 

mission 

series 

privacy 

would 

orbit 

think 

security 

sin 

satellite 

chance 

government 

Christian 

april 

maria 

nsa 

faith 

shuttle 

nhl 

nsa 

accept 

satellite 

see 

secure 


Table 3: Topic words 


Conclusion and Future Work 

Rather than ignoring the semantics of the words and as¬ 
suming that the topic distribution within a document is 
conditionally independent, in this paper, we introduce an 
ordering-sensitive and semantic-aware topic modeling ap¬ 
proach. The GMNTM model jointly learns the topic of 
documents and the vectorized representation of words, sen¬ 
tences and documents. The model learns better topics and 
disambiguates words that belong to different topics. Com¬ 
paring to state-of-the-art topic modeling approaches, the 
GMNTM outperforms in terms of perplexity, retrieval ac¬ 
curacy and classification accuracy. 

In future works, we will explore using non-parametric 


models to cluster word vectors. For example, we look for¬ 
ward to incoporating infinite Dirichelet process to automati¬ 
cally detect the number of topics. We can also use hierarchi¬ 
cal model to further capture the subtle semantics of the text. 
As another promising direction, we consider building topic 
models on popular neural probabilistic methods, such as the 
Recurrent Neural Network Language Model (RNNLM). The 
GMNTM model has appplications to several tasks in natural 
language processing, including entity recognition, informa¬ 
tion extraction and sentiment analysis. These applications 
also deserve further study. 
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